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ABSTRACT 

We present the ^E^pC— |— system, a user-driven advancement 
of glrSX — a semantic extension of DTpX that allows for 
producing high-quality PDF documents for (proof)reading 
and printing, as well as semantic XML/OMDoc documents 
for the Web or further processing. Originally glpjK had 
been created as an invasive, semantic frontend for author- 
ing XML documents. Here, we used ^IpjX in a Software 
Engineering case study as a formalization tool. In order to 
deal with modular pre-semantic vocabularies and relations, 
we upgraded it to sT[5X-|- in a participatory design pro- 
cess. We present a tool chain that starts with an gTEX-|- 
editor and ultimately serves the generated documents as 
XHTML+RDFa Linked Data via an OMDoc-enabled, ver- 
sioned XML database. In the final output, all structural 
annotations are preserved in order to enable semantic infor- 
mation retrieval services. 

Categories and Subject Descriptors 

D.2.1 [Software Engineering]: Requirements/Specifica- 
tions — Languages; 1.2.4 [Artificial Intelligence]: Knowl- 
edge Representation Formalisms and Methods — Representa- 
tion languages; 1.7.2 [Document and Text Processing]: 
Document Preparation 

General Terms 

Documentation, Human Factors, Languages, Management 

Keywords 

formalization, I^TpX, Linked Data, software engineering, se- 
mantic authoring, annotation, metadata, RDFa, vocabular- 
ies, ontologies 

1. INTRODUCTION 

An important issue in the Semantic Web community was and 
still is the "Authoring Problem": How can we convince peo- 
ple not only to use semantic technologies, but also prepare 
them for creating semantic documents (in a broad sense)? 



Here, we were interested in formalizing a collection of DTjtX 
documents into a set of files in the OMDoc format, an XML 
vocabulary specialized for managing mathematical informa- 
tion, and further on to Linked Data for interactive browsing 
and querying on the Semantic Web. 

Concretely, the object of our study was the collection of doc- 
uments created in the course of the 3-year project "Siche- 
rungskomponente fur Autonome Mobile Systeme (SAMS)" 
at the German Research Center for Artificial Intelligence 
(DFKI). SAMS built a software safety component for au- 
tonomous mobile service robots developed and certified it as 
SIL-3 standard compliant (see [13] ). Certification required 
the software development to follow the V-model (figure [TJ 
and to be based on a verification of certain safety proper- 
ties in the proof checker Isabelle [33]. The V-model man- 
dates e. g. that relevant document fragments get justified 
and linked to corresponding fragments in other members of 
the document collection in an iterative refinement process 
(the arms of the 'V from the upper left over the bottom to 
the upper right and in-between in figure [T|. 
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Figure 1: A Document View on the V-Model 

System development with respect to this regime results in a 
highly interconnected collection of design documents, certi- 
fication documents, code, formal specifications, and formal 
proofs. This collection of documents "samsdocs" 1 35 make 
up the basis of a case study in the context of the FormalSafe 
project [12] at DFKI Bremen, where they serve as a basis for 
research on machine-supported change management, infor- 
mation retrieval, and document interaction. In this paper, 
we report on the formalization project of the collection of 
DTfX documents in samsdocs (that we will without further 
ado also abbreviate with samsdocs). 



Not surprisingly, the interplay between the fields Semantic 
Web and Human- Computer Interaction played an important 
role as the "Authoring Problem" of the first is often tack- 
led via methods of the second. One such approach is that 
of "invasive technology" [2l] with the basic idea that from a 
user's perspective, semantic authoring and general editing 
are the same, so why not offer semantic functionalities as 
an extension of well-known editing systems, thereby 'invad- 
ing' the existent ones. We started with DTfiX not only be- 
cause a good portion of our case study was written in it, but 
also as DTjtX constitutes the state-of-the art authoring so- 
lution for many scientific/technical/mathematical document 
collections. Despite its text-based nature it is widely consid- 
ered the most efficient tool for the task. Therefore, we used 
the invasive OMDoc frontend for DTjnX documents called 
SjlijjX [26]. In the formalization process its conceptual us- 
ability weaknesses (for the task) were identified and within 
a participatory design process it evolved into the invasive 
formalization tool gTfjX-l-. 

In section [2] we will present the gFpfi. system, especially its 
realization of Linked Data creation. Then we describe in 
section[3]the formalization process of samsdocs with g^DX; 
our challenges, and our (pre-)solutions. In section [4] we re- 
port the enhancements of (jTjiX realized in and for the case 
study to gTfrX-|-. Having gTfnpC— |— documents with Linked 
Data and ontological markup, we describe (potential) ser- 
vices and their implementation design in section[5] Section[3] 
summarizes related work, and section[7]concludes the paper. 

2. sTeX: OBJ.-ORIENTED ETeX MARKUP 

STgX [26| |37| is an extension of the DTjtX language that 
is geared towards marking up the semantic structure un- 
derlying a document. The main concept in ^LjTpC is that 
of a "semantic macro", i.e., a TfrX command sequence S 
that represents a meaningful (mathematical) concept C: the 
TfrX formatter will expand 5 to the presentation of C. For 
instance, the command sequence \positiveReals (from 
listing]!]) is a semantic macro that represents a mathematical 
symbol — the set K + of positive real numbers. While the use 
of semantic macros is generally considered a good markup 
practice for scientific documents (e.g., because they allow 
to adapt notation by macro redefinition and thus increase 
reusability), regular Tr^/DTp^X does not offer any infras- 
tructural support for this. gTjjiX does just this by adopting 
a semantic, 'object-oriented' approach to semantic macros 
by grouping them into "modules", which are linked by an 
"imports" relation. To get a better intuition, consider 

Listing 1: An tjTpjX Module for Real Numbers 

\begin {module } [id-reals] 

\importmodule [ . . /background/sets ] { sets } 
\symdef {Reals } {\mathbb{R} } 
\symdef {greater} [2] {#1>#2} 
5 \symdef {posit iveReals } { \Reals"+ } 
\begin{ definition} [id=posreals . def , 
title-Positive Real Numbers] 
$\def eq\pos it iveReals 

{ \ set st { \ inset {x} \Reals } { \greater{x} } } $ 
o \end{ definition} 

Vend {module } 



which would be formatted to 



Definition 2.1 (Positive Real Numbers): 



{x g R | x > 0} 



Here, sTJhX's \symdef macro - invasive by to its delib- 
erate resemblance of (La)TfrX's \def and \newcommand 
- generates a respective semantic macro, for instance the 
XpositiveReals with representation R + . Note the sym- 
bol inheritance scheme of gTEX : The markup in the mod- 
ule reals has access to semantic macros \setst ("set such 
that") and \ inset (set membership) from the module sets 
that was imported by the document \importmodule direc- 
tive from the | ■ ■ /background/sets ■ tex] Furthermore, it 
has access to the \def eq (definitional equality) that was in 
turn imported by the module sets. 



From this example we can already see an organizational ad- 
vantage of sTpJX over DTfiX: we can define the (semantic) 
macros close to where the corresponding concepts are de- 
fined, and we can (recursively) import mathematical mod- 
ules. But the main advantage of markup in gTEX is that it 
can be transformed to XML via the DTfrXML system 
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Listing [5] shows the OMDoc [25] representation generated 
from the sTFjX sources in listing [I] OMDoc is a semantics- 
oriented representation format for mathematical knowledge 
that extends the formula markup formats OpenMath [7] and 
MathML 12] to a document markup format. 



Listing 2: An XML Version of Listing [T] 

<theory xml : id-"reals"> 
<imports from-". . /background/set s . omdoefsets " /> 
<symbol xml : id-"Reals " /> 
<notation> 

5 <prototype><OMS cd="reals" name="Reals"/x/prototype> 
<renderingxm : mo>IR</m :mo></ render ing> 
</notation> 

<symbol xml : id-"greater" /><notation>. . .</notation> 
<symbol xml : id-"positiveReals"/xnotation>. . .</notation> 

<definition xml : id="posreals . def " for-"positiveReals"> 
<meta property-"dc : t it le ">Positive Real Numbers</meta> 
<OMOBJ> 
<OMA> 

<OMS cd="mathtalk" name="def eq"/> 
5 <OMS cd-"reals" name-"positiveReals"/> 

<OMA> 

<OMS cd-"sets" name-" set st " /> 
<OMA> 

<OMS cd-"sets" name-" inset " /> 
<OMV name="x"/> 

<OMS cd="reals" name-"reals"/> 
</OMA> 
<OMA> 
<OMS cd="reals" n 
5 <OMV name="x"/> 

<OMI>0</OMI> 
</OMA> 
</OMA> 
</OMA> 
</OMOBJ> 
</def inition> 

</theory> 



"greater"/> 



One thing that jumps out from the XML in this listing 
is that it incorporates all the information from the sT^rX 
markup that was invisible in the PDF produced by format- 
ting it with TeX. 

OMDoc itself has been used as a storage and exchange for- 
mat for automated theorem provers, software verification 
systems, e-learning software, and other applications [251 chap- 



ter 26], but due to its focus on semantic structures, it is not 
intended to be consumed by human readers. The Java-based 
JOMDoc [19] library uses the notation elements to gener- 
ate human-readable XHTML+MathML from OMDoc. Fig- 
ure]^ shows the result of rendering the document from list- 
ing [2] in a MathML-aware browser. In contrast to the PDF 
output we can directly create from STeX, XHTML+MathML 
allows for interactivity. In particular, our JOBAD Java- 
Script framework enables modular interactive services in 
rendered XHTML+MathML documents [w]. These services 
utilize the semantic structures of mathematical formulae. In 
our rendered documents, each formula in human-readable 
Presentation MathML carries the original semantic Open- 
Math representation of the formula, as shown in listing [2] as 
a hidden annotation. 

Client-side JOBAD services, which exclusively rely on anno- 
tations given inside a document, have already been imple- 
mented for folding and unfolding subterms of formulae and 
for controlling the display of redundant brackets in complex 
formulae. The symbol definition lookup service, shown in 
figure [2j interacts with a server backend: It traverses the 
links to symbol and their corresponding definition el- 
ements that are established by the OMS elements in Open- 
Math - for example, <OMS cd="sets" name=" inset "/> 
encodes the URI | . . /background/ sets . omdoc#inset| — 
and retrieves the document at that URI as XHTML+Math- 
MLQ JOBAD's ability to integrate an arbitrary number of 
services, which can talk to different server backends and 
which are enabled depending on the context, i.e., the se- 
mantic structure of the part of a mathematical formula that 
the user has selected, turns our rendered mathematical doc- 
uments into powerful mashups [28]. On any symbol, for 
example, definition lookup is enabled. On any expression 
where a number is multiplied with a special symbol repre- 
senting a unit of measurement, a unit conversion client that 
talks to a remote unit conversion web service is enabled. The 
JOBAD architecture has been designed without depending 
on a particular backend; for most of our services we are us- 
ing the extensible XML-aware database TNTBase [39] |40] 
11 , which has special support for OMDoc and integrates 
the JOMDoc rendering library. 

DEFINITION: 

R+ := fvF H I f> fit 



Definition Lookup (defeq) 



The simplest form of definition schema is the 
simple definition . This just introduces a name (the 
definiendum ) for a compound object (the 
definiens ). Note that the name must be new, i.e. 
may not have been used for anything else, in 
particular, the definiendum may not occur in the 
definiens. We use the symbols := (and the inverse 
= : ) to denote simple definitions in formulae. 

Figure 2: Listing [Jj as Dynamic XHTML+MathML 



1 This is the MathML way of representing Linked Data. In 
section[5] we describe how we have now extended this feature 
to coverKDFa Linked Data. 



3. FORMALIZATION WITH sT E X TOWARDS 
STEX+ 

In this section we describe the process of formalizing the 
SAMSDocs collection of DTpX documents created in the course 
of the SAMS project with the S^F^ system. We use the user's 
perspective to point to the requirements for gTpX-|- that 
evolved in this process. 

As we all know all too well: Formalizing is never easily done. 
In our project we had the additional challenge of doing it 
without corruption of the PDF layout that was produced 
with DTpX. Here, STeX fits well, as it generates PDF and 
transforms to XML. In figure[3]we can see the general course 
of action: 

i) we identified document fragments ("objects") that con- 
stitute a coherent, meaningful unit like the state of a 
document "rd." or its description "ready for certifica- 
tion", then 

ii) we translated it into the g^EX format, realizing for ex- 
ample that "rd." is a recurring symbol and "ready for 
certification" its definition (therefore designing the sams- 
Docs macro "SDdef"), and finally 
Hi) we polished these macros in the gTpX specific sty-files 
so that the PDF layout remained as before and the 
generated XML represented the intended logical struc- 
ture, for instance the use of the OMDoc XML elements 
symbol and definition. 

Note that definitions are common objects in mathematical 
documents, therefore STeX naturally provides a definition 
environment. So why didn't we use that? Because the doc- 
ument model of OMDoc, which we obtain by transforming 
gTpX using DTjtXML, does not allow definitions in tables, 
as the former are stand-alone objects from an ontological 
perspective. If one authors a formal document, this view 
is taken, so no problem arises, but if one formalizes an ex- 
isting document, layout and cognitive side-conditions have 
to be taken into account. We therefore realized that we 
could not simply add basic STeX markup to the DTfrX source 
yielding formal objects, we rather needed to add pre-formal 
markup in the formalization process (we speak of (seman- 
tic) preloading). 

Whenever project-wide (semantic) layout schemes were dis- 
covered, that were frequently used, we extended the macro 
set of STFjX suitably (enabling preloading "project structures" [22] , 
i. e. project-induced ones which is quite different from "doc- 
ument [layout] structures" [ibid.], e.g. by subsections that 
is supported by core features, see DCMsubsection in 

figure pjp. The table layout for example was often used for 
lists of symbol definitions. So we created the SDTab-def 
environment which can host as many SDdef commands as 
wanted (see fig. [3|. This increased the efficiency of the for- 
malizing process tremendously. 

Another difference between authoring and semantic preload- 
ing consisted in the order of the formalization steps. While 
the order of the first typically consists of "chunking" (i. e., 
building up structure e. g. by setting up theories), "spotting" 
(i. e., coining objects), and "relating" (i. e., making relation- 
ships between objects or structures explicit), the order of 
the second is made up of spotting, then relating or chunk- 



\ suiDsecciot^{ Document States} 

\ re neiiJcorra't'iaridC \ array st retch} (1.5) 
\ beg in< tabular} [t] { 11} 

i.p. in progress \\ 

rd. & ready for certification \\ 

ct. £ certified \\ 

& informative, will not be examined \\ 
\ end{ tabular} 
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Doc Lime lit States 

i.p. in progress 

id. ready for certification 

ct. certified 

intVmmtive. will m>( bo examined 
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\ DCMsutosection[id=sec .states -doc] { Document States) 
\ renewcomroandC \ arraystretch) (1.5) 

\begin{SDTab-def } [aligmtient = (t} ,f ormat={ 11} , id=states-doc, 

context={sec. states-doc) , hlineUp=f low, hlineDoun=f lou] 
\ SDdef tsymbol={ i . p . } , id=state-doc-ip, desc={ in progress}, hline=flow] 

: [sytrlDol={ rd. } , id=state-doc-rd, desc=<ready for certification) , hline=flow3 
. SDdef [syittool={ ct . } , id=state-doc-ct , desc={ certified} , hline=f low] 

\SDdef {symbol=i } , id=state-doc- inf o, de3c={ informative, will not toe examined), hline=flow] 

\end{SDTab-def ) 



OMDoc T fc 

<orogroup c lass" "subsect ion" stex :srcre±" "document Plan. tex#textrange [t rom=181; □, to=239; 14) "> 
<metadata 3tex : srcref ""document Plan. tex# text range (from 3 181; 0, to= 181; 60) "> 

<dc:title>Docuiment States</ :title> 
</metadata> 

< theory stex : 3rcref ""document Plan. tex# text range (f rom= 185;0,to £I igO;15)" xml : id" "states-doc . thy"> 
<symbol name""state-dac-rd" xml : id-"state-doc-rd, sym"> 

<CMP xml : id-"state-doc-rd . sym . p 1 ")^^fe</ CMP> 
</symbol> 

<def inition xml : id""state-dQ C-rd.def " for""state-doc-rd" > 

<CHP> ready for certification </CHP> 
</def inition> 



Figure 3: The Formalization Workflow via cjTgjX: Definition Table of "document state" 



ing. The last two were done simultaneously, because ^TfrjX 
offers a very handy inheritance scheme for symbol macros — 
as long as the chunks are in order, which could be sensibly 
done for some but not for all at this stage in the formal- 
ization process. Generally, many 'guiding' services of sTE^i 
that sTgX considered to be features, turned out to be too 
rigid. 

As a consequence we heavily used very light annotations at 
the beginning: It was sufficient to identify a certain docu- 
ment fragment and to mark it with a referencable ID like 
"state-doc-rd". Shortly afterwards, we realized that some 
more basic markup was necessary, since we wanted to for- 
malize our knowledge of types/categories of these objects 
and their conceptual belonging. For this we developed a 
set of "ad-hoc semantification macros" with named at- 
tributes like SDob ject [id] , SDmore [id, cat, for], 
SDisafid, cat, for, follows, theory, imports, tab] , 
or SDref erences [ id, f ile , ref id]|f] The 'more' func- 
tionality provided by SDmore was required due to logically 
contiguous objects that were interspersed in a document. 
With this set we preloaded "object structures" [ibid.], i.e. 
object-induced ones. Note that the ad-hoc semantification 
macros enabled the formalizer to develop her own metadata 
vocabulary. 

As soon as the document boundaries went down, we real- 
ized that an object had many occurrences in several of the 
documents in the samsdocs collection. For example, first 



2 We use subsets of a general attributes set for all of our 
gTgX extensions to lower the learning curve for the use of 
the markup macros. 



an object was introduced as a high-level concept in the con- 
tract, then it was specified in another document, refined in 
a detailed specification, implemented in the code, reviewed 
at some stage, and so on until it was finally described in 
the manual. Thus, we had to preload "collection struc- 
tures" [ibid.] as well, which consisted in the development 
process model, the V-model as seen in figure [I] Here, we 
built our personal V-model macros, e. g. SemVMref ines, 
SemVMimplements, or SemVMdescribesUse. 

Additionally, we created an "jlF^- extension especially suited 
for preloading "organizational structures" [ibid.] . This is con- 
sidered different from project structures as organizational 
markup is very probable to be reusable for other projects 
with the same organizational structures. For example, SAMS 
used a document version management as well as a docu- 
ment review history, so that environments VMchangelist, 
VMcertif ication with corresponding list entry macros 
VMchange, VMcertified were built. Another example is 
the processing state of a document, which can be marked up 
easily by using the VMdocstate macro as seen in figure [4] 

We noted that the necessary formalization depth of some 
documents was naturally deeper than others. For example, 
it didn't seem sensible to formalize the contract too much, 
as it was created as a high-level communication document, 
whereas the detailed specification needed a lot of formaliza- 
tion. The manual had an interesting mixed state of formality 
and informality, as it was again geared towards communi- 
cation, but it needed to be very precise. In conclusion we 
note that the mathematical content of the documents (i.e., 
the mathematical objects and their relations) was only one 
of the knowledge sources that needed to be formalized and 



sTeX: Definienclum 

\fcegin{SDTato-def > [alignments t} , f ormatM 11) , id= states- doc, 
\ SDdef [symtaol={ i . p ,}, id=3tate-doc-ip, desc={in progress} ,..] 
\ SDdef [symbol={ rd. } , id=state-doc-rd, desc={ready for certification} 
\ SDdef [symbol={ ct . } , id=state-doe-ct, desc=(certified} ...] 

\end{SDTab-def} 



for OMDoc in previous work and implemented a translation 
between OMDoc and OWL [3l|[3Q] . 



\begin( document} 

\ clocStareC rd. } : : document state 



Mieg i 11 { document } [ id=biraking-model] 

\ VHdocstate{\ SDreferencesNoObj [f ile=docmnentPlan, ref id=state-doc-rd] ( rd. } } 



OMDoc X LATgXML 


<owdoc awls id""braki»g-Kiodel" about""#brak 


ng-model"> 


<link rel-"v:hasState"> 




<resource typeof="sd: Ref erence"> 




<link tel" B 3d: file" r e source" "doeume. 




<link rel""sd: ref id" resource _ "dOcum 


rd"/> 


</resource> 




</link> 




</omdoc> 





Figure 4: Referencing a "document state" 



marked up. In the course of the formalization it has be- 
come apparent that the knowledge in such complex collec- 
tions is multi- dimensional (cf. [22] for an in-depth analy- 
sis). Thus, the requirements for extending ^TjtX to cT]hX-|- 
were (i) to generate XML output that preserves the seman- 
tics annotated in the preloading phase, (ii) and to take into 
account the multi-dimensionality of our ad-hoc semantifica- 
tion macros in a way that technically enables browsing and 
querying. These requirements were satisfied by enabling the 
generation of RDFa from our annotations and making them 
accessible to Linked Data services, as we will describe in the 
following sections. 

4. sTeX+: A METADATA-EXTENSION OF 
STEX 

All the arrows in figure [I] are examples of relations between 
document fragments in the samsdocs corpus that needed to 
be made explicit in addition to the mathematical relations 
that cTfrX had originally supported; the revision histories 
of documents and the social networks of their authors con- 
stitute further dimensions of knowledge. For situations like 
these, we had incorporated RDFa [I] as a flexible metadata 
framework into the OMDoc format [3l]. In the course of 
this case study, the RDFa integration was revised and ex- 
tended and will become part of the upcoming OMDoc ver- 
sion 1.3 [27]. The main idea for this integration is to realize 
that any concrete document markup format can only treat a 
certain set of objects and their relations via its respective na- 
tive markup infrastructure. All other objects and relations 
can be added via RDFa annotations to the host language - 
assuming the latter is XML-based. 

It is crucial to realize that, for machine support, the meta- 
data objects and relations are given a machine-processable 
meaning via suitable ontologies. Moreover, ontologies are 
just special cases of (mathematical) theories, which import 
appropriate theories for the logical background, e. g. descrip- 
tion logic, and whose symbols are the entities (class, proper- 
ties, individuals) of ontologies. Thus, cTjnX and OMDoc can 
play a dual role for Linked Data in documents with math- 
ematical content. They can be used as markup formats for 
the documents and at the same time as the markup formats 
for the ontologies. We have explored this correspondence 



To understand our contribution, note that we can view DTfrX 
and tjTfrX as frameworks for defining domain-specific vocab- 
ularies in classes and packages; DTJhX is used for layout as- 
pects, and ^TjiX can additionally handle the semantic as- 
pects of the vocabularies. ^rgpC uses this approach to de- 
fine special markup e.g. for definitions (see lines 10 to 31 
in listing [5|. Note that to define ^TeX markup functional- 
ity like the definition environment, we have to provide a 
DTpX environment definition (so that the formatting via 
DTpX works) and a DTpXML binding (to specify the XML 
transformation for the definition environment). As the 
OMDoc vocabulary is finite and fixed, cTjtX can (and does) 
supply special DTjtX macros and environments and their 
DTjtXML bindings. But the situation is different for the 
flexible, RDFa-based metadata extension in OMDoc 1.3 we 
mentioned above, with a potentially infinite supply of vocab- 
ularies. At the start of the samsdocs preloading effort, ^IeX 
already supported a common subset of metadata vocabular- 
ies. For instance the Dublin Core title metadata element 
in line 11 of listing [2] is the transformation result of using 
the KeyVal [9] pair title=. . . in the optional argument of 
the definition environment. 

For the samsdocs case study we started in the same way 
by adding a package with DTgXML bindings to ^TfrX. The 
\VMdocstate macro shown in the "^TjtX" box of figure [4] 
allowed us to annotate a document with its processing state. 
This is transformed to an RD Fa-annotated omdoc root el- 
ement, as shown in the "OMDoc" box underneath and in 
the black, solid parts of the RDF graph in figure [5] We can 
already see that the ^LgX extension for samsdocs exactly 
consists in a domain-specific metadata vocabulary exten- 
sion, and that using the custom vocabulary hides markup 
complexity from the author. Again, samsdocs only needed 
a finite vocabulary extension, so this approach was feasible, 
but of restricted applicability, since developing the samsdocs 
package for jjTfrX required insights into ^T[hX internals and 
DTjtXML bindings. Thus this extension approach lacks the 
flexible user-extensibility that would be needed to scale up 
further. 

To enable user-extensibility, we add a new declaration form 
\keydef to the core $IFjX functionality (yielding cTjtX-|-) 
— like \symdef in that it is inherited via the module im- 
ports relation, only that it defines a KeyVal key instead of 
a semantic macro. To understand its application, we ratio- 
nally reconstruct the v: hasState relation from the exam- 
ple in the OMDoc box of figure [4] To do this, we use "jTfrX to 
create a metadata vocabulary for document states: we create 
a certification module, which defines the hasState 
metadata relation and adds it to the KeyVal keys of the 
document environment. The metalanguage macro is a 
variant of importmodule that imports the meta language, 
i. e., the language in which the meaning of the new symbols 
is expressed; here we use OWL. 



Listing 3: A Metadata Ontology for Certification 

\begin { module } [id-certification] 
\metalanguage [ . ./background/owl] {owl} 
\keydef { document } {hasState } 



\symdef { st ate-doc-rd } [1] {rd. #1} 
5 \symdef {tuev} { \text {T\"UV} } 

\begin { definition} [ f or=hasState] 

A document { \def iniendum [ hasState ] { has state}} SxS, iff 

the project manager decrees it so. 
\end { definition } 
10 \begin{ definition} [ f or-state-doc-rd] 

A document has state \def iniendum [ state-doc-rd] { rd . $x$} 

iff it has been submitted to $x$ for certification. 
\end { definition } 

\begin{ definition} [f or^tuev, hasState=$\statedocrd\tuev$ ] 
15 The $\tuev$ (Technischer \ "Uberwachungsverein) is a 
well-known certification agency in Germany. 
\end { definition } 
\end{ module} 



In this paper, we focus on using gTfnpC— |— as a language 
for defining lightweight vocabularies. Note, however, that 
"heavyweight" formal semantics can be added to vocabulary 
terms in the same way as has been shown for mathematical 
symbols in listing [l] Similarly as the "real numbers" module 
relies on an sTfjX module that introduces set theory, the cer- 
tification ontology relies on an gT^X module that introduces 
the OWL language. Such an OWL ontology that has been 
written in gl^rjX— |— can be translated to one of the widely 
supported serializations of OWL via two paths: (i) In the 
original workflow, the ^I^X-|- source is translated to OM- 
Doc. Thanks to their modularity and literal programming 
capabilities, the gTpJX-|- or OMDoc representation allows 
for an expressive documentation of OWL ontologies. But, 
as OMDoc is not universally understood on the Semantic 
Web, we have implemented a translation of OWL ontologies 
encoded and documented in OMDoc to the standard RD- 
F/XML representation [31] . (ii) Alternatively to this pre- 
viously existing translation via OMDoc as an intermediate 
representation, we are working on a direct gT^X— |— to OWL 
transformation. Simply using our experimental [owl 2 onto] 
class [23] instead of the | omdoc] class from cTjeX in the DTfrjX 
preamble will cause DTgXML to generate OWL - here in the 
direct OWL XML serialization - instead of OMDoc from a 
subset of the S^LiX - ! - markup. 



Listing 4: Annotating a Document with Certifica- 
tion Metadata 

\importmodule [ . . /ontologies /cert ] {certification } 
2\begin { document } [hasState=$\statedocrd{ \tuev} $ ] 

\end { document } 



Let us now see how to use a vocabulary: If we import the 
certification metadata module, we can write to gener- 
ate RDFa annotations that correspond to the (red) dotted 
arrow in figure [5] Note that in the state of formalization 
shown in figure^] the SAMSDocs-specific RDF vocabulary 
still has a pre-semantic structure. With the STeX - ! - we can 
express that the processing state is actually intended to be a 
symbol in a metadata theory, not just some semantic object 
in some file. In listing [3] we use the \symdef directive to 
generate the symbol state-doc-rd and \keydef to gen- 
erate a metadata relation hasState that is expressed by 
a key of the same name, which is added to the document 
environment. When processed by DTgXML, \keydef takes 
care of generating correct URIs for the metadata relations 
and their target resources, resulting in an RDFa output syn- 
tactically similar to figure [4] In conclusion, we note that 




Figure 5: RDF View on a "doc. state" Assignment 



STgX-|- allows us to rationally recreate the effect we pre- 
viously achieved with the custom \VMdocstate and \SD 
ref erencesNoOb j macros. Note that we did not have to 
extend the DTgXML bindings at all for this extension. Thus, 
STgX-|- gives us a generic TfrjX— >-RDFa translation, which 
works for arbitrary vocabularies^] 

5. sTeX+ documents as linked data 

The translation of classical STeX to OMDoc and further 
to XHTML+MathML (see section which results in a 
Linked Data like markup for mathematical symbols, en- 
ables interactive services in mathematical formulae. Now 
that STeX - ! - supports formalization with arbitrary meta- 
data (cf. section Q, it should additionally be possible to 
utilize these metadata for services. Both types of annota- 
tion complement each other: A practical SLFjX - ! - document, 
like many of the SAMSDocs, would combine elements from 
listing [4] with those from listing [T] and consequently rely on 
services for both types of semantic structures. 

The JOBAD service architecture (see section [5| gives uni- 
form access to common queries in the document browsing 
user interface. In the SAMSDocs scenario this might be a 
query for all persons who have worked on the current doc- 
ument. This can directly be answered from the metadata 
of the revision log. Another typical query would consist in 
asking for all parts of a specification that have to be re- 
certified. Answering this query involves revision logs (for 
finding documents that have changed since the last cer- 
tification), collection structures (V-model dependencies of 
changed parts), and mathematical structures (logical de- 
pendencies). In 122] we have elaborated on such SAMSDocs 
queries from the point of view of their stakeholders (like 
engineers, project managers, certifiers), particularly explor- 
ing the multi-dimensionality of the formal structures. For 
example, a project manager may find a substitute for an em- 
ployee E, who has implemented a specification, by tracing 
back a link from the documentation of the implementation to 
the specification document and finding out, from the meta- 
data of that document, who has recently been working on it. 
Here, we will summarize the extensions made to our system 
architecture to enable these services. 

As a first step, we made the JOMDoc Tenderer preserve the 
RDFa metadata from the OMDoc documents, now gener- 
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extends this to 



3 Our experimental rdfameta package 
arbitrary DTgX documents: It redefines common DTJ3X 
commands (e. g. the sectioning macros) so that they in- 
clude optional KeyVal arguments that can be extended by 
\keydef commands. With this metadata extension, we can 
add RDFa metadata to any existing DTjnjX. 



ating XHTML+MathML+RDFa. Additionally, the mathe- 
matical structures (those that are above the formula level) 
had to be preserved in the rendered output. Even though 
OMDoc uses native non-RDFa markup for these structures, 
we can also represent these in RDF, exploiting the OM- 
Doc ontology (see |29| 1 1 1] for more information). Exist- 
ing JOBAD services recognized mathematical formulae in 
XHTML presentations of OMDoc documents by their se- 
mantic structure (e. g. whether they use previously defined 
symbols or units of measurement). Similarly, new services 
can now recognize from the RDFa annotations whether a 
chunk of an XHTML document is, e. g., an implementation 
of a specification fragment, and by which user requirement 
that is induced. Compared to the previously existing defini- 
tion lookup service, the principle of retrieving content from 
a target URI and displaying it in a popup remained the same 
- the URIs are just provided by different annotations. 

Secondly, we have extended the folding of subterms of math- 
ematical formulae to higher-level structures, such as require- 
ments or steps of structured proofs. We have implemented 



this using the rdfQuery JavaScript library 38 , which parses 



all RDFa annotations of a document into a local triple store 
that can be queried using SPARQL-like JavaScript func- 
tions. On the server side, we have extended TNTBase [39] , 
our versioned database backend and web server/application 
framework to accept commits of STe^ - ! - documents, auto- 
matically convert them to OMDoc, and then serve OMDoc, 
XHTML+MathML+RDFa, and, optionally, RDF/XML, ac- 
cording to the Linked Data best practices [17| . 

Even the pre-semantic annotations like the ones shown in 
figure [4] afford interactive services: A generic reference can 
already be utilized for lookup and navigation. Providing 
additional information in the instance document or in the 
ontology (e. g. the knowledge about the target of a reference 
being a symbol or a processing state) allows for making the 
service user interface more specific and enables the display 
of more relevant related information. For the generic pre- 
semantic "references" relation, the list of all semantic objects 
that it relates to each other would be too large for being 
usable, as there is no obvious way of ranking or filtering 
the link targets. But once more specific link types are used, 
such as the "has state" link, that information can be used to 
display a list of documents grouped by state. 

Queries across documents cannot be answered using the 
above-mentioned rdfQuery: client side queries require a com- 
bination of querying a local triple store and crawling links. 
In our setup, we have experimented with SQUIN [16], a fron- 
tend to the Semantic Web Client library [I] , which gives ac- 
cess to Linked Data via a simple HTTP frontend at very low 
integration costs: If the server provides standard-compliant 
Linked Data, then the client simply has to access the URL 
of the SQUIN server, providing a SPARQL query as a pa- 
rameter. An alternative would have been AJAR library, a 
part of the Tabulator Linked Data browser [3], which im- 
plements the same functionality in JavaScript. In our test 
setup, SQUIN acted as a proxy between the client-side Java- 
Script code and our Linked Data. While a Linked Data 
crawler is most flexible when data are distributed across 
many servers (e. g. when an OMDoc document links to DB- 
pedia), its query answering capabilities are only as good as 



the Linked Data being served. For example, if the RDF(a) 
does not contain back-links (like links from a mathemati- 
cal theory to the theories it imports and to the theories by 
which it is imported), then an AJAR- or SQUIN-powered 
client cannot query links in both directions. Moreover, the 
performance of such a solution is limited, as it requires mem- 
ory for the local triple store as well processor time for query 
answering on the client side. Therefore, in the samsdocs 
setting, where the queries are currently limited to a docu- 
ment collection on a single server, the best solution is storing 
the triples on that same server, and making them accessi- 
ble via a standard query interface. Concretely, we make a 
SPARQL endpoint powered by the Virtuoso triple store [34] 
available as an extension to TNTBase 11 . In a larger Soft- 



ware Engineering scenario (like a document collection of a 
company with multiple departments) a combination with a 
Linked Data crawler, as offered by the Sponger extension 
to Virtuoso in an integrated server-side fashion, may have 
advantages: if all these departments publish their document 
collections as Linked Data in the company intranet (see for 
instance [36] for the topicality of this example), crawling 
these may reveal previously unknown connections, e. g. col- 
leagues dealing with structurally similar problems who could 
lend advice. Note that local vocabularies resulting from ad- 
hoc semantification need not be a barrier to knowledge ex- 
change: Linked Data practices recommend connecting oc- 
currences of semantically equivalent resources in different 
data sets by owhsameAs. Alternatively, if it turns out that 
one department uses a "better" vocabulary for their data, the 
ST[hX-|- metadata extensions make it easy to adopt it: all we 
have to do is to change the ST£X-|- bindings or \keydefs|^] 

6. RELATED WORK 

We have presented S^F^ - ! - as an extension of the DTjtX 
language for both authoring Linked Data vocabularies and 
annotating semantic documents with them. Thus, it is ob- 
viously related to other semantic extensions of DlgX. But, 
when considering ST[hX-|- as a text- and macro-based fron- 
tend to OWL and RDFa, it can also be compared to other 
ontology /vocabulary authoring and document annotation 
frontends, including such with graphical user interfaces. 

SALT [l5] also allows for annotating semantic relations in 
DTjtX documents and exporting them as Linked Data. SALT 
is restricted to a fixed set of rhetorical and bibliographi- 
cal relations, plus the metadata fields of widely used docu- 
ment classes like LNCS, both of which it embeds as RDF 
annotations in the generated PDF, whereas STe^ - ! - allows 
for (re)using arbitrary relations plus defining custom ones. 
The target format of STE^ - ! - is RDFa inside the generated 
OMDoc and XHTML+MathML. We have concentrated on 
that target, since it supports dynamic interactions via our 

4 Reuse of vocabularies is not limited by traditional re- 
strictions of TjtX, which has a single global namespace for 
macros, and where no two keys passed to a command or en- 
vironment may have the same name. ST[iX groups symbols 
into modules; STeX - ! - does the same for keys. When two 
symbols or keys that have the same local name relatively 
to their module are imported into another module M, there 
are facilities for giving them distinct names for usage inside 
M. For example, when there is already a key name, but 
the name property from the FOAF ontology should also be 
reused, we can set up a qualified import of the latter, e. g. as 
FOAFname. 



JOBAD system. An export of the metadata relations to 
XMP annotations embedded in PDF should be possible with 
the technology employed in SALT; we leave this to future 
work. 

SOBOLEO 6 is a lightweight graphical user interface for 
creating and editing vocabularies/ontologies in OWL based 
on Web 2.0 tagging approaches. In |5j, the authors evalu- 
ate its usage along their "Ontology Maturing Process Model", 
in which they confirm the succeeding phases "emergence of 
ideas", "consolidation in communities", "formalization", and 
"axiomatization" in an ontology engineering process. Our 
observed phases of spotting, relating and chunking essen- 
tially correspond, as the "emergence of ideas" period did not 
apply (the documents were already created). Interestingly, 
the "consolidation in communities" phase does not only have 
to be thought of as a development time: We found it reified 
in samsdocs like the V-model relations, loomp is an exam- 
ple of a WYSIWYG editor for annotating HTML documents 
with terms from vocabularies, yielding RDFa [18]. GUI tools 
traditionally separate the task of vocabulary creation from 
document annotation; this also holds for SOBOLEO (re- 
sponsible for the former task) and loomp (responsible for 
the latter). sTe^ - ! - ! on the other hand, gives access to both 
tasks via the same interface: TgX macros, which are once 
declared, and once used - possibly even in the same source 
file. 

7. CONCLUSION AND FUTURE WORK 

We reported on a formalization case study, where we use 
the S^eX format, a document formatting system and spec- 
ification platform for semantic, mathematical vocabularies, 
on a document corpus from Software Engineering. To cope 
with the the multi-dimensional semantic structure implicit 
in the document collection, we extended sTjrX into a markup 
platform for semi-formal ontologies and Linked Data called 
gTpX-|- (in our case semi-formal documents with RDFa- 
based metadata annotations). 

The key observation from our case study is that if we use 
gTgX-|- as a human- and document-oriented frontend for 
Linked Data documents, we can approach the formalization 
of semi-formal document collections as a process of "docu- 
ment and ontology co-development" , where (in our case pre- 
existing) documents are semantically preloaded with inter- 
and intra-document relations, whose meaning is given by 
(project-specific or general, reusable) metadata ontologies. 
As we have seen in section [3] preloading documents and de- 
veloping metadata ontologies in a joint frontend format re- 
duces formalization barriers. For instance, we often have 
to elaborate informal document fragments into metadata 
vocabularies; see the discussion about the "rd." document 
state. 

For practical applicability of the SHrX-|- approach, machine 
support for authoring and managing sTfeX document collec- 
tions is crucial. As a client-side counterpart to the integrated 
repository and Linked Data publishing solution provided by 
TNTBase [ll], we are currently developing an integrated 
collection authoring environment gTfrjXIDE for s4jnjX on the 
basis of the Eclipse framework |20j. We expect that ex- 
tending gTjrjXlDE to operationalize the gTfrX-|- functionality 
presented in this paper will turn it into an IDE for document 



collection and ontology co-development that will enable au- 
thors to cope with the complexities of dealing with large col- 
lections of semi-formalized documents. On the other hand, 
we expect the modular cjTj^XIDE system to be a good ba- 
sis for deploying supportive services in a flexible document 
collection environment. 

We conjecture that the gT[j}X-|- based workflow for docu- 
ment and ontology co-development can be extended to ar- 
bitrary Linked Data applications. 
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